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Abstract: The Commission of the European Union, as well the United 
States Environmental Protection Agency, have set limit values for some 
pollutants in the ambient air that have been showrn to have adverse effects 
on human and environmental health. It is therefore important to identify 
regions where the probability of exceeding those limits is high. We propose 
a two-step procedure for estimating the probability of exceeding the legal 
limits that combines smoothing in the time domain with spatial interpo- 
lation. For illustration, we show an application to particulate matter with 
diameter less than 10 microns (PMio) in the North-Italian region Piemonte. 
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1. Introduction 

For many physical processes, it is important to model the probability of ex- 
ceeding specific thresholds, above which negative effects on human health and 
the environment have been observed. Examples include climatological processes 
(temperature, precipitation), and air pollution processes (ozone, fine particles). 

One way of estimating exceedances over high thresholds is given by method- 
ology using extreme value theory. A review of statistical techniques for analyzing 
extreme values can be found in Smith (1989), together with a detailed analysis 
of ozone time series. Davison and Smith (1990) analyze the generalized Pareto 
distribution used for modeling exceedances over high thresholds, and give ap- 
plications to river flows and wave heights. Tawn (1992) uses a modified version 
of the joint probabilities method for the analysis of extremes of non-stationary 
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sequences, with an illustration on estimation of extreme sea levels exceeded by 
the annual maximum hourly sea height with a prescribed probability. Niu (1997) 
discusses extremal properties of non-stationary time series, and shows an ap- 
plication to stratospheric ozone in Chicago, concerning the exceedance of daily 
maxima over the national standard. Picgorsch et al. (1998) present a thorough 
overview of statistical methodology for extreme values and rare events with 
focus on environmental applications. Chavcz-Dcmoulin and Embrcchts (2004) 
describe smooth non-stationary generalized additive models for extremes, and 
incorporate splines in a model for exccedances over high thresholds applied to 
finance and insurance problems. 

An alternative, more general approach for modeling exceedances, is via method- 
ology in time series and spatial statistics for probability distribution functions 
and quantiles. Also, it is often the case that environmental data have large tem- 
poral coverage (hourly or daily measurements over long periods), and smaller 
spatial coverage (a relatively small number of monitoring locations) . Such data 
can be viewed as a collection of long time series that are spatially correlated. 
The problem of interest is to map the exceedance probabilities over a fixed 
threshold. Many times, the observed time series display different patterns at 
the different monitoring sites, and show significant departures from stationarity 
and Gaussianity. 

In this note we introduce a two-stage exploratory technique that combines 
smoothing in the time domain with spatial interpolation, to produce maps of ex- 
ceedance probabilities for the spatial domain of interest. Motivated by the good 
temporal coverage of the data, and in order to provide a flexible, comprehensive 
approach, we start by smoothing the observed or 1 exceedance probabilities, 
then interpolate the resulting estimates by using a parametric spatial covariance 
function. 

A flexible way to characterize complex temporal dependence, is via a time- 
varying transformation G(t,Zt) of a stationary process Zt. The physical evo- 
lution of many real-life processes makes this a plausible working assumption. 
As the unknown transformation G is allowed to vary with time, the probability 
distribution function of the resulting process may also change, and therefore 
the process may not be stationary. Ghosh ct al. (1997) studied the asymptotic 
properties of a nonparametric estimator of the marginal probability distribu- 
tion function in this setting, where the underlying process Zt was assumed to 
be Gaussian and having long memory. A similar estimator was analyzed in 
Draghicescu (2003), Draghicescu and Ghosh (2003) for the case when the un- 
derlying Gaussian process has short memory (under the general assumption that 
the correlations are summable) . A data-driven procedure for optimal bandwidth 
selection for these kernel estimators was proposed in Ghosh and Draghicescu 
(2002), and discussed in detail in Draghicescu (2003). A new kernel distribution 
function estimator was recently discussed in Swanepocl and Van Graan (2005), 
where a data-based choice of bandwidth was also proposed. The method holds 
for independent, identically distributed (iid) data, and can be extended for 
weakly dependent observations. Bosq (1998) provides a comprehensive overview 
of nonparametric methods for stochastic processes. However, there are many 
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open problems regarding spatial modeling of distribution functions. One imme- 
diate, albeit naive approach, is given by the so-called indicator kriging (Chiles 
and Delfiner 1999, page 383), which is an adaptation of universal kriging (spatial 
interpolation). Recently Short ct al. (2005) introduced a fully hierarchical ap- 
proach for modeling distribution functions for bivariate spatial processes, using 
a Bayesian framework implemented via MCMC methods. 

The Commission of the European Union, as well as the United States Envi- 
ronmental Protection Agency, have set limit values for some pollutants in the 
ambient air, that were proved to have a negative impact on human and envi- 
ronmental health. In particular, recent studies linked traffic-related pollutants 
to increased risks of morbidity and mortality due to respiratory and cardiovas- 
cular illness (see for example Samct et al. 2(J0(J and the references therein). It is 
therefore important to identify regions where the probability of exceeding these 
legal limits is high. 

In this paper, we focus on particulate matter with diameter less than 10 
microns (PMio). Exploratory analyses and basic statistical models for this pollu- 
tant are employed in McKendry (2000), Ignaccolo and Nicolis (2005), 
Rajsic et al. (2004), among others. A review of recent studies on particulate 
matter is given in Schimek (2003), where a semi-parametric model including 
weather information is used to link particulate matter to hospital admissions in 
a regional study, controlling for potential spatial dependencies. In contrast to 
these studies, where PMio is modeled directly, we focus on the probability of ex- 
ceeding the legal limit. We use a two-stage procedure to estimate the space-time 
exceedance probability over a given threshold. The paper is organized as follows. 
The data set that motivated this study is described in Section 2. Section 3 is 
devoted to statistical methodology, followed by Monte Carlo simulations in Sec- 
tion 4, and an application to air pollution data in Section 5. A brief discussion 
is given is Section 6. 

2. A motivating data set 

We analyze daily PMio concentrations (in fig/m^) during 2004 at 22 sites in 
the North-Italian region Piemonte. The data were collected through the infor- 
mation system AriaWeb Regione Piemonte. A detailed description of this mon- 
itoring network can be found in Ignaccolo and Nicolis (2005). The 22 records 
used in this study were selected from Low Volume Gravimetric (LVG) monitors, 
such that the amount of missing data did not exceed 10%. The missing values 
were imputed by using kernel regression smoothing with adaptive plug-in band- 
width (Gasscr et al. 1991). The maximum allowable number of days to exceed 
50 ng/m^ is 35, and therefore 50 [iglrr? is the threshold corresponding to the 
0.904 quantile during one year. Detailed explanations of the European norms for 
air pollution are given in van Aalst et al. (1998). Figure 1 shows the locations 
of these 22 sites, together with the respective 0.904 quantiles. 

It can be observed that the sites near the Alps have lower PMio values, 
whereas higher pollution levels are detected in the valleys, closer to urban ar- 
eas. Regarding the temporal variations of this pollutant, we found that the 
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behavior of PMio has different patterns at different locations, yielding different 
patterns in the behavior of the probability of exceeding 50 fig/m^. Given the 
good temporal coverage of the data, we employ smoothing in the time domain 
to first model these exceedance probabilities in a flexible and comprehensive 
way. For an illustration, we show in Figure 2 the series of daily PMio concen- 
trations for year 2004 at three locations, the corresponding or 1 probabilities 
of exceeding 50 fig/rir^, and the associated smoothed exceedance probabilities. 

With respect to the temporal dependence structure, these 22 time series 
displayed short-range dependence, as shown by the boxplots of their autocorre- 
lation functions in Figure 3. 

The same data set is used in Bande, Ignaccolo and Nicolis (2006), where ad- 
ditional meteorological and geographical information is included in a model 
for the space-time PMio trend (mean function). With regard to modeling the 
probability of exceeding the cutoff PMio legal value, the preliminary study 
Draghicescu and Ignaccolo (2005) proposes a sequential modeling technique for 
the maps of exceedance probabilities in the same region, using data from 17 
monitors for the year 2003. This method will be briefly described in Section 4, 
and used for comparisons in the simulations and applications. 
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Fig 2. Time series of daily PMio concentrations, year 2004 (I'^ft)r corresponding or 1 
series representing the 50 ^ig/m^ exceedance probabilities (middle), and smoothed exceedance 
probabilities (right) at station 5 (top), station 19 (middle), and station 11, (bottom). 



3. Theoretical framework 

Assume that at each location s £ D (for some domain D <E 'R?) we ob- 
serve a temporal process of the form Xs{t) — Gs{t, Zs{t)), where Gg is an 
unknown transformation, Zs is a standardized stationary Gaussian process with 
7s(0 := cov{Zs{t), Zs{t+l)), such that X^i^-oo l7s(OI < This general class of 
processes includes non-stationary and non-Gaussian situations, and is therefore 
suitable to model complex environmental data sets. Also note that no paramet- 
ric assumptions are made on the temporal covariance structure of the process. 
For fixed a;o G R, define the exceedance probability Pxo{t, s) = P{Xs{t) > xq). 
By using the axioms of probability, it is immediate to see that P^o (t, s) takes 
values in [0, 1] and is non-increasing in xq. The problem of interest is to predict 
Pxo (^1 s*) at location s* £ D where there arc no observations, and at any time t, 
based on observations of the process Xs{t) at n time points ii, . . . , t„, and at m 
spatial locations si, . . . , s„i- Typically m is much smaller than n. Furthermore, 
for fixed t, these exceedance probabilities are assumed to be isotropic in space, 
meaning that their spatial covariances depend only on the Euclidean distance 
between the respective locations. 
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Fig 3. Boxplots (over the 22 locations) of the empirical autocorrelation functions j or the time 
series of daily PMio at these locations, year 2004- 



The method we propose consists of two steps. We start by niodehng the tem- 
poral exceedance probabihties for any location where the process is observed. 
Let Xs{ti)^ . . . , Xs{tn) be the daily PMio observed concentrations, assumed to 
be realizations of a process Xs{t) = Gs{t, Zs{t)), and xq the PMio legal limit. 
We stress again that the true exceedance probability may change with time and 
space. Therefore the empirical estimator P„,m(xo) := J2i=i YlT=i ^i^-j {ti)>xo} 
(analogous to the empirical distribution function) is too rough, and cannot cap- 
ture such changes, making it necessary to consider different weights, that should 
depend on the local behavior of the process. As the problem of interest is not to 
assess the mean function of the process, but the exceedance probability over the 
given threshold xo, we focus on the indicator process l{Xa(t)>a:o}- We assume 
further that the changes of this indicator process with time are smooth for all 
s E D, namely that l^Xs{ti)>xo}^'^xo{tii s) = X]fc=i '^k,s,xo {U)Hk{Zs{U)), where 
ti = are rescaled time points, denotes the Hermite polynomial of degree k, 
and the coefficients Ck,s,xo s-re twice continuously differentiable with respect to t, 
and continuous with respect to s and Xg . For examples of time- varying transfor- 
mations of Gaussian processes and detailed explanations of assumptions similar 
to the ones above, we refer to Draghicescu (2003). In the second step we use 
spatial interpolation to predict the exceedance probability over the specified 
threshold at any location in the region of interest. 
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3. 1 . Estimation of temporal exceedance probabilities 

In the initial step we model the temporal risks non-parametrically, by using the 
Nadaraya- Watson kernel estimator 

, . _ Er=i-^(V)i{x.(t.)>.o} 

where iiT is a kernel function. For details on kernel smoothing we refer to 
Wand and Jones (1995). Note that the temporal bandwidth bt should not de- 
pend on the threshold xq, in order for the resulting estimator to be non- increasing. 
In what follows, to keep notation simple, we write b instead of bt. 

Theorem 3.1. In the above notation, if ^ [P^oitj s)] exist a.e. in [0, 1] and if 

[Vo,'>'[l{x,(t)>xo})] < n- ^ oo, 5^0 and nb oo, for all s £ D and 

fixed Xq € R, for the estimator (1) we have 

(a) Consistency: 

MSE(P,,,{t,s)'^ ^o(^max(^b\^^y (2) 

(b) Asymptotic normality: 

P,Jt,s)-Ep,,it,s) 



,JVar{P,„{t,s)) 



N{0,1). (3) 



Sketch of the proof. By using Taylor expansions and integral approxima- 
tions of the sums over the kernel function, it follows that Bias(Pxo{t: s)) = 
eP,„ (i, s)-P,o (t, s) = B{t, s, XQ)b^+o{b^), where B{t, s, xq) = i ^ [P^o «)] * 
intu'^K{u)du. Also, Var{Pxg{t, s)) = j^V{t,s,Xo)+o(j^), where V{t,s,Xo) 
is bounded. Specifically, Ghosh and Draghicescu (2002) prove that there exist a 
temporal covariance function with the lag-k covariance denoted by g{k, t, t') such 
that Vit,s,xo) = EtiE"=iKC-^)KC-^)gi\i- j\,t,t'). Consistency then 
follows from Chebyshev's inequality. The asymptotic normality is an immediate 
application of results in Brcuer and Major (1983). 

Remark 3.1. Estimator (1) has the same rate of convergence as in the iid case. 

Remark 3.2. An optimal bandwidth can be obtained by minimizing the mean 
squared error of the estimator (2). Note that, as in the iid case, bopt ~ Cn~'s . 
In practice, an optimal bandwidth (local or global) can be obtained by using 
plug-in estimators (approximations) of the bias and variance (see for example 
Ghosh and Dragliiccscn 2002, Draghicescu 2003). While local (time-dependent) 
bandwidths have the advantage of dealing with edge effects (at the ends of the 
time interval), it is often the case that global (integrated over time) bandwidths 
considerably decrease computational time, without significant change in the 
resulting estimators. 
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Remark 3.3. In practice Var (P^oitj s)) can be estimated by smoothing the 
empirical covariances, that provide good approximations for the above g{k, t, t'), 
and used to produce confidence bands based on the asymptotic normahty of 
Px„(t, s). In our preUminary analyses of the PMio data, this approach yielded 
very narrow bands. 



3.2. Spatial interpolation 

In the second step, wc use use spatial interpolation (universal kriging) to predict 
the exceedance probability field at a location s* e D where there are no obser- 
vations, under the assumption of space-time separability. Wc model the spatial 
covariances |si — Sj| |) :— Cov(P{t, sj, P(t, Sj)) paramctrically. by using the 
Matern stationary covariance model 

^*(ll^^--^^ll) = 2^^i-TrR(, Vt ) Vt ) 

for fixed t. Our analyses did not detect major violations of spatial isotropy 
(that is, the spatial correlations only depend on distance, and not on direc- 
tion). The threshold a;o is also fixed. To keep notation simple, we omitted it 
in expression (4). r(-) is the usual gamma function and /Ciyj(-) is the modified 
Bessel function of the third kind of order ft (Abramowitz and Stcgmi 1972). 
The parameter i/t > characterizes the smoothness of the process, at denotes 
the variance of the transformed random field, and pt measures how quickly the 
correlations of this field decay with distance. These parameters arc estimated 
by maximum likelihood. Then, the best linear unbiased predictor (BLUP) of 
the exceedance probability field at Sq & D, is given by a linear combination 
P*(t, So) = Si^i where the weights A^, 1 < z < m are completely 
determined by the covariance parameters vtiPti a-nd CTj. The standard error of 
P*(t, So) can be also expressed in terms of the interpolation parameters A^. 
This procedure is known as universal kriging (Stein 1999), and implemented in 
many software packages. In practice, the covariance parameters are estimated 
from the same data. To account for their uncertainty, the standard errors of 
P*(t, So) need to be adjusted. This can be done by using conditional simulation 
techniques (Stein 1999, Chapter 6), or resampling schemes (Lahiri 2003). 

Remark 3.4. Linear interpolation does not guarantee that the resulting ex- 
ceedance probability estimator/predictor takes values in the interval [0,1]. A 
1 : 1 transformation (such as x ^ rfrjs^) '^^'^ used first, then perform inter- 
polation on the transformed field, and finally invert to obtain the desired ex- 
ceedance probabilities. However, this technical detail did not provide improved 
maps for the present study. 



4. Numerical simulations 



In this section we present a simulation study that was carried out in order 
to analyze the performance of the exceedance probability estimator P*jj(t, sq) 
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Fig 4, One realization of the simulated spatial field at fixed time point t = 100 (left), su- 
perimposed locations of 25 randomly chosen sites; one realization of the process at location 
marked "X" (right). 



introduced in Section 3. Wc simulated stationary isotropic space-time processes 
on a 20 X 20 grid and 200 time points. We considered a separable covariance 
model factorized as a Whittle-Matern spatial covariance by a stable temporal 
covariance, C{u,h) = CT{u)Cs{h) = cryexp(— u") * a'^jq^h'^ K..y{h), where 
u = \ti ~ tk\ is the time lag and h = \\si — Sj|| is the spatial distance between 
two sites, r(-) is the usual gamma function, and /C'y(-) is the modified Bessel 
function of the third kind of order 7. The computations were done in R, using 
the function GaussRF in the library RandomFields. We used a\ = 0.7, a\ = 1.3, 
a = 0.2, and 7 = 0.5. One realization of the process is shown in Figure 4. 

In order to assess the performance of the mapping procedure described in Sec- 
tion 3, we used the following three methods. They are all sequential, the second 
step being the same in all three cases (universal kriging with Matcrn covariance 
function). First method is indicator kriging, that is spatial interpolation of the 
empirical exceedance probabilities. It will be referred to as IND. The next ap- 
proaches require a preliminary smoothing step in the temporal domain. Thus, 
the second method, referred to as EDF, is the two-step procedure introduced in 
Draghicescu and Ignaccolo (2005). For every time point t^, a weight is assigned 
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Table 1 

Means and standard deviations over 100 simulations of root mean squared errors 



E200 (P^Q(ti,so)'- 



-,(ti,»o))^ 



at location marked "X" in Figure 4, based on data at the 



randomly selected sites (m=24), and on data on the whole grid (m=400). Method IND; 
spatial interpolation of the observed and 1 exceedance probabilities; method EDF; spatial 
interpolation of smoothed exceedance probabilities with weights proportional to the empirical 
distribution function; method KER; spatial interpolation of estimated exceedance 
probabilities obtained by kernel smoothing 





xo = 

m = 24 m = 400 


xo = 2 

m = 24 m = 400 


IND 
EDF 
KER 


0.772 (0.0698) 0.0154 (0.1320) 
0.0095 (0.0983) 0.0098 (0.7095) 
0.0083 (0.0572) 0.0071 (0.0352) 


0.1001 (0.0759) 0.078 (0.0521) 
0.0542 (0.0930) 0.0327 (0.0198) 
0.0493 (0.0087) 0.0278 (0.0065) 



to the observed exceedance probability or 1, corresponding to the order of the 
quantile of the PMio observation for that time point (i.e. the empirical distri- 
bution function EDF). For example, if the observed value is 80, and 75% of the 
data are less than 80, the respective weight will be 0.75, that is EDF(80). Then 
the exceedance probability is estimated by a weighted average of the observed 
exceedances on a time window centered at t^. In the simulations, applications 
and validation study in next section, the window width was fixed and equal to 
7 time points. The third method, called KER is based on kernel smoothing, as 
described in detail in Section 3. 

Table 1 shows the means an d standard deviations over 100 simulations of 

the root mean squared errors " °^20o''°^ " location marked 

"X" in Figure 4, for two thresholds, xq = (corresponding to the median), and 
xq = 2 (corresponding to the 0.9 quantile). For each time point ti, the predictor 
P*p(ti, So) was computed based on data at the randomly selected sites (m=24), 
as well as on data on the whole grid (m=400). It can be seen that for both 
thresholds these root mean squared errors are lowest for KER and largest for 
IND. 



5. Maps of exceedance probabilities for PMio in Piemonte 

We apphed the three methods previously described to the Piemonte PMio data. 
We estimated the 50 ng/m? exceedance probabilities for each day of the year 
2004. 

Figure 5 displays these exceedance probabilities on February 18, 2004 (left), 
and February 19, 2004 (right), based on the aforementioned methods. On Febru- 
ary 18 the maximum value of PMio for 2004 was observed at all sites, with a 
large decrease the following day. In this case IND does not seem to work well, 
the maps are very rough, and there is an abrupt change from a day to the next. 
The unshaded region on the February 19 map is due to negative values of the 
predicted exceedance probabilities, that, for the applied goal of this study, can 
be viewed as zeros. EDF yields maps with high values on almost all the region, 
while KER shows higher values around the Torino area. As observations outside 



D. Draghicescu and R. I gnaccolo/ Modeling threshold exceedance probabilities 159 




300 350 400 4 50 500 300 350 4O0 450 500 



Fig 5. Estimated 50fig/m'^ exceedance probabilities on February 18, 2004 fl^ft) o,nd February 
19, 2004 (right); IND (top), EOF (middle), KER (bottom). 



D. Draghicescu and R. I gnaccolo/ Modeling threshold exceedance probabilities 160 




Fig 6. Estimated 50 fig /m-^ exceedance probabilities averaged over summer (left) and winter 
(right); IND (top), EDF (middle), KER (bottom). 
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Fig 7. Boxplots (over the 22 locations) of the root mean squaerd prediction errors for 50 
lig/rrfi exceeding probabilities: IND (left), EDF (middle), KER (right). 



Piemonte were not available, the estimated exceedance probabilities near the 
boundaries tend to have larger errors. 

While these methods provide maps of point estimates, and could give a good 
understanding of the process (by using animation techniques for example), it is 
often the case that transformations of point estimates are also of interest. For 
instance, seasonal averages are very informative summaries for both scientists 
and policy makers. Figure 6 shows estimated exceedance probabilities over 50 
jjig/m^ averaged over summer (left) and winter (right), obtained through the 
three methods. Summer is considered the period between April 1 and September 
30. From a geographical prospective these methods yield maps that are not 
very different. However, from a statistical point of view, EDF or KER have the 
advantage of making use of all the information in the data. As expected, in both 
seasons the highest risks are around the Torino area, with larger values during 
the winter. 

To assess the performance of these methods, we carried out a cross-validation 
study. We used the leave-onc-out principle and estimated the daily 50 iig/m? 
exceedance probabilities at each site, based on observations at the remaining 
21 sites for each day. The boxplots in Figure 7 display the root mean squared 
prediction errors obtained by leaving out that site, computed as the square root 
of the averages (over the 366 days of the year 2004) of the differences in squared 
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exceedance probabilities ("observed" - predicted). Here, the "observed" values 
are or 1 for IND, and the corresponding smoothed values (after first step) for 
EDF and KER, respectively. As expected, IND performs the worst, while KER 
yields the smallest errors. 

6. Discussion 

The methodology proposed in this paper provides a good descriptive and visual 
tool for modeling threshold exceedance probabilities based on space-time data 
with large temporal and relatively small spatial coverage. It is statistically accu- 
rate, computationally fast, and flexible enough to be suitable for processes with 
complex space-time dependencies in many applied fields, such as environmental 
science and management, atmospheric sciences, ecology, epidemiology, finance, 
medicine. The method KER, involving kernel smoothing in time, followed by 
spatial interpolation, was proved to provide the most accurate and informative 
maps of exceedance probabilities. 
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